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Abstract 

The high-intensity, repetitive noise associated with functional mag- 
netic resonance imaging hinders on-hne monitoring of subjects' speech 
and/or recording speech signals suitable for off-line analysis. The pro- 
posed algorithm enhances the speech signal by suppressing the scanner 
noise in the signal recorded by a single-channel microphone. Signifi- 
cant increases in signal-to-noise ratio are achieved using an adaptive 
filter that combines time and frequency domain elements. In addition 
to providing a recording suitable for speech analysis, such a real-time 
system provides an alternative means (to, e.g., the "panic ball") for 
communication between the patient and the operator during image ac- 
quisition. 
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I. Introduction 



During a functional magnetic resonance imaging (fMRI) experiment, loud noise gen- 



erated by the gradient coi 



images (IRavicz et al. 



s of the scanner typically accompanies the acquisition of brain 
2000l ). The intensity of this noise can vary between 100 and 120 dBA 
SPL and the energy content is mostly concentrated in frequencies below 3 KHz, which is 
also the frequency range most relevant to speech signals. Therefore, such noise is particu- 
larly detrimental to recording speech in the scanner. Being able to record speech is not only 
important for a variety of speech and language related studies, but also provides a natural 
mechanism for a person in the scanner to communicate with the operator in the control 
room. Prior efforts to reduce such noise utilized frequency-domain spectral subtraction or 
time-domain template subtraction approaches (Table [11). 



The spectral subtraction approach (INelles et al. 



20031 ) removed stationary noise from 



measurement by subtracting a noise magnitude spectrum estimate from successive short- 
time spectral estimates of the overall signal, and inverting using the overall signal phase 
information (Boll, 1979). The effectiveness of this approach relies on separation between the 
spectral properties of the speech signal and the noise source and is limited in this scenario 
wher e the scanner noise spectrum overlaps the speech spectrum. Template-based subtrac- 



tion (ICusack et al. 



2005 



Jung et al. 



20051 ) eliminates characteristic noise signals by simple 
subtraction of a time-domain template from the overall measurement in noise-correlated 
time frames. Such an approach requires high temporal sampling rate for accurate template 
matching. Both approaches in their simplest form assume noise properties are constant. 
The algorithm proposed here combines time and frequency domain elements into an adap- 
tive approach implemented as a real-time software program that processes the acoustic signal 
acquired using off the shelf hardware components. 

The details of the algorithm are presented, followed by the results of simulations using 
synthetic data and of using the system during an fMRI experiment. 



Ghosh, Noise Suppression for speech during fMRI 



II. Background 

The signal, y{t), recorded by a microphone placed near the mouth of the subject in the 
scanner comprises three different components: (1) the voice signal, v{t), if present; (2) the 
scanner gradient noise, g(t); and (3) other extraneous noise sources, n(t), present in the 
environment (e.g. the helium pump, breathing). 

y{t)=v{t)+g{t)+n{t) (1) 

The goal of the real-time algorithm is to estimate and continually refine g{t) and to use 
the estimate in order to recover v{t). Most of the energy content of g{t) is concentrated 
below 5kHz and the spectral components of g{t) overlap v(t) as shown in Figured] 

Based on recorded scanner noise data, the algorithm assumes that 

\g{t)\ » \v{t)\ > \n{t)\ 

and that g{t) is periodic. Currently, images during functional scanning are acquired in 
a planar manner, with a stack of two-dimensional (2-D) images making up a 3-D image 
volume. During an fMRI session, the role of gradient-switching is to select the plane from 
which to acquire the MR signal. It is this switching that generates the scanner "noise" 
and hence the periodicity of g{t) is dependent on the time between the onset of two planar 
acquisitions. Thus, 

g{t)^g{t+-) (2) 
n 

where, T is the time taken to acquire n 2-D images to create the 3-D volume. 

III. Algorithm 

The algorithm has three stages. The first stage involves a time-domain estimation in 
order to initialize the noise template. In the second stage, the estimated template is matched 
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to the signal. In the third and final stage, the template is subtracted from the matched 
segment. The template is updated if the matched segment does not contain speech. The 
second and third stages are repeated iteratively over the duration of recording. Table [TTl lists 
and describes the parameters and the signal vectors used in the algorithm, the details of 
which are described next. It is important to note that the algorithm operates sequentially 
on buffers of length N and has a processing delay r (typically less than 200ms) related to 
the buffer length and the estimated duration of the scanner noise template. 

A. Step 1. Gradient noise template estimation 

The key to effective suppression is an accurate initial estimation of gradient noise tem- 
plate vector g. A double sliding-window cross-correlation approach is used to estimate g, in 
which the correlation between two adjacent windows of equal duration is calculated. This 
calculation is repeated over a short range of incremental window durations, since the pe- 
riodicity can change by a few samples from slice to slice. The noise template, g, is set to 
the samples in the window for which this cross-correlation exceeds a pre-specified threshold 
Oxcorr- To maximize computational efficiency, the search for g can be constrained by the 
periodicity of g(t) {lest = T/n; see Eq. |2]) and only window durations of lest ± w are used. 

B. Step 2. Template matching 

Once g(t) has been estimated, it can be correlated with samples in input audio buffer 
to determine a match. Computation time can be reduced by computing the correlations 
with lags between N + 1 and 2N, again leveraging the periodic property of g{t). A template 
match occurs when the peak correlation over the span of lags exceeds a specified threshold 
Ocorr- Similar to prior time-domain approaches, the noise template is then subtracted from 
the matched segment Xb to yield the residual Xres- 
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C. Step 3. Template subtraction and update 

If Xb contained a speech signal and the template was a perfect match, then the residual 
Xres should ouly contain the speech signal. However, because of other noises n{t) in the 
system, the residual may contain additional noise sources. To further enhance the speech 
signal, we perform a weighted frequency-domain subtraction (Eq. |3]) of the magnitudes of 
the spectral components. The weighting function w provides a mechanism to fine-tune 
the suppression. The estimated voice signal, v, is recovered by taking the inverse Fourier 
transform (J-") of this magnitude spectrum combined with the phase information from J-'Xres- 
An ideal digital filter (d) may be incorporated at this stage to limit the bandwidth of the 
signal. If V has minimal energy (< Orms), then most of the content of the buffer was likely 
scanner noise. This is then used to update the template g (Eq. [5]). Equation [5] indicates that 
an increasing value of 7 will keep the estimates of g similar over longer periods of time. Thus 
7 can be used to control the similarity between successive updates of g. This dependence on 
prior estimates guards against periods of recording when the estimate is not updated due to 
the presence of voice in the signal. 



r= [|-^Xres|-«W0|^g|]+ (3) 

v = M(j^-i(doroe-''^^''--)) (4) 
g = 7g + (1 - 7)xb , when RMS(v) < Orms (5) 

In the equations above, o denotes a Schur product and [-J^ indicates half- wave rectifica- 
tion. 

IV. Hardware setup and data collection 

All recordings were made with a Shure condenser microphone (model no: SM93). The 
microphone was placed inside a foam windscreen and was mounted on the headcoil a few cms 
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from the subject's mouth. All ferromagnetic components of the microphone (primarily in 
the connectors) were stripped before using it inside the scanner room. The microphone cable 
to the supplied preamplifier was rerouted through RF filters mounted on the scanner filter 
panel. The preamplifier was connected to a MOTU audio device (model no: 828mkII), which 
supplied the microphone with the necessary phantom power. The setup did not introduce 
any artifacts in the acquired images or degrade it's quality. Data for simulations and testing 
were collected on 3T Siemens scanners. 

V. Simulations 

Recordings containing scanner noise only and speech only were used for simulations. The 
formulae used for quantifying initial signal to noise ratio (SNR) and improved SNR (ISNR) 
are listed in Table IIIII 

A. Influence of weighting functions: w 

Frequency-domain weighting functions w differentially affect the amount of noise sup- 
pression (NS) achieved on noise only sequences. A weighting function w = corresponds to 
time domain subtraction, while the function w = 1 is equivalent to increasing the suppression 
parameter a. Not surprisingly, for recovery of speech signals, the most effective weighting 
function was one that suppressed higher frequency content more than low frequency con- 
tent. To quantify the difference, ISNR were measured by comparing the SNR before and 
after noise suppression of synthesized data. 

Single word speech utterances (e.g., Mm.l), v{t), were added to scanner noise g(t) with 
different signal to noise ratios (SNR) and relative phase (e.g.. Mm. 2 and Mm. 3). Two 
different weighting functions were used: (1) w = 0; and (2) w(/) = |/|. Improvement in 
signal to noise ratio (ISNR) was calculated over the utterance only. Figure E] shows ISNR 
as function of original SNR and the different weighting functions. The variance is computed 
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over different values of a and O^ms- 

Mm. 1. Recorded speech used in simulation. This is a file of type "wav". Mm. 2. 
Synthesized mixture (SNR = -20 dB) used in simulation. This is a file of type "wav". Mm. 
3. Synthesized mixture (SNR = -5 dB) used in simulation. This is a file of type "wav". 

In general, a higher original SNR leads to a higher ISNR. However, at SNRs close to 
dB, the speech signal disrupts the pulse-train like nature of the scanner noise and violates the 
\g{t)\ » \v{t)\ assumption. This results in greater misalignment in step 2 of the algorithm 
and therefore reduces the amount of improvement in SNR. Figure [3] shows the acoustic results 
from two of the simulations, one at -20dB SNR (Mm. 4) and the other at -5dB SNR (Mm. 
5). Results of processing actual recordings made in the scanner are available as Mm. 6 and 
Mm. 7. 

Mm. 4. Cleaned recording (-20db SNR) processed using the algorithm. This is a file of 
type "wav". Mm. 5. Cleaned recording (-5dB SNR) processed using the algorithm. This is 
a file of type "wav". Mm. 6. Recorded speech in scanner. This is a file of type "wav". Mm. 
7. Cleaned recording processed using the algorithm. This is a file of type "wav" . 

VI. Conclusion 

We have presented an adaptive, online algorithm that can be used to suppress scanner 
noise and thereby provide an effective channel of communication between the subject in the 
scanner and the operator. It also allows acquisition of verbal responses in fMRI studies. The 
algorithm combines time-domain and frequency- domain techniques for noise reduction and 
can be easily implemented on a computer or a DSP board for specialized operation. The 
numerical description of the noise cancellation process provides a generalized framework that 
can be easily extended. The parameters can be easily optimized for different pulse sequences 
and other similar repetitive signals. 
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TABLE I. Different software based approaches used in the past for cancellation of fMRI 
noise. (T-Domain : time domain, F-Domain : frequency domain, NRT : near real-time) 

Reference T-Domain F-Domain Adaptive NRT 



Nelles et al. (2003} x 

Jung et al. (2005) x x 

Cusack et al. (2005) x 

Current Proposal x x x 
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TABLE II. Algorithm parameters and signal vectors 



Parameters 



^est 


Estimated duration or noise template (sj 




lest ~ T/n 


±W 


Variation in lest (s) 


Sr 


sampling frequency (Hz) 


N 


Pramelength (0.025^ samples) 


T 


Butter length (samples) 




T = AT + 2{lest + w)Sr 


^xcorr 


cross-correlation threshold for estimating 




template 


6corr 


correlation threshold for estimating 




signal match to template 


a 


Noise spectrum scaling parameter 


6rms 


Template update threshold 


7 


Template update parameter 


Signal vectors 


g 


Estimated noise template 




Matched signal from buffer 


V 


Noise suppressed signal 


Xres 


Xb - g 




Fourier transform of x 


w 


Weighting function on the magnitude 




spectrum of the noise template 


d 


Digital filter (e.g., low pass filter at 5 KHz) 
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TABLE III. Formulae for quantifying simulations 

Value Formula 
Noise suppresion(NS) 
Signal to noise ratio (SNR) 
Improvement in SNR (ISNR) 



g-g||2 



201ogio 
201ogio^ 
201ogio 



g||2 
V||2 



g||2 



V-V 2 
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List of Figures 

FIG. 1 Comparison of power spectral density of speech (black) and scanner noise 
(gray) during fMRI. The scanner noise was collected from a 3 Tesla Siemens 
Trio scanner. The speech signal was recorded from a male speaker. Power 
spectral density estimates were obtained using the Welch algorithm with a 
window of 100ms and an overlap of 80% [14 

FIG. 2 Improvement in signal-to- noise ratio (ISNR) shown for two weighting func- 
tions over a range of SNRs Il5 

FIG. 3 The left panel shows the time-domain waveforms and corresponding spectro- 
grams of the speech (the utterance "three") and the scanner noise signals. 
The middle and the right panels of the top row show the synthesized noise- 
corrupted signals (top: SNR = -20dB and bottom SNR = -5dB). The middle 
and the right panels of the bottom row show the noise reduced waveforms, 
spectrograms and the improvement in SNR (ISNR). Spectrograms have a 
frequency range from 0-5KHz [16 
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Welch Power Spectral Density Estimate 




10 

Frequency (kHz) 



FIG. 1. 
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Scanner noise 



Voice+Noise (SNR = -20dB) Voice+Noise (SNR = -5dB) 
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FIG. 3. 
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